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Abstract. The use of composable abstractions allows the application of new and established 
algorithms to a wide range of problems while automatically inheriting the benefits of well-known 
performance optimisations. This work highlights the composition of the PETSc DMPlex domain 
topology abstraction with the Firedrake automated finite element system to create a PDE solving 
environment that combines expressiveness, flexibility and high performance. We describe how Fire¬ 
drake utilises DMPlex to provide the indirection maps required for finite element assembly, while 
supporting various mesh input formats and runtime domain decomposition. In particular, we de¬ 
scribe how DMPlex and its accompanying data structures allow the generic creation of user-defined 
discretisations, while utilising data layout optimisations that improve cache coherency and ensure 
overlapped communication during assembly computation. 
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1. Introduction. The separation of model description from implementation fa¬ 
cilitates multi-layered software stacks consisting of highly specialised components that 
allow performance optimisation to happen at multiple levels, ranging from global data 
layout transformations to local kernel optimisations. A key challenge in designing such 
multi-layered systems is the choice of abstractions to employ, where a high degree of 
specialisation needs to be complemented with the generality required to facilitate the 
utilisation of third-party libraries and thus promote code reuse. The use of high-level 
domain-specific languages (DSL) and composable abstractions allows existing algo¬ 
rithms and optimisations to be inserted into this hierarchical framework, and applied 
to a much wider range of problems. 

In this paper we describe the integration of the DMPlex mesh topology abstrac¬ 
tion provided by the PETSc library [2] with Firedrake, a generalised system for the 
automation of the solution of partial differential equations using the Finite Element 
method (FEM) [20]. We outline how DMPlex is utilised in Firedrake to provide 
the required mapping between topological entities and degrees of freedom (DoFs), 
while supporting various mesh input formats, run-time domain decomposition and 
mesh renumbering techniques. In particular, we describe how DMPlex and its ac¬ 
companying data structures allow the generic creation of user-defined discretisations, 
while utilising data layout optimisations that optimise cache coherency and ensure 
computation-communication overlap during finite element assembly. 

2. Background. 

2.1. Firedrake. Firedrake is a novel tool for the automated solution of Finite 
Element problems defined in the Unified Form Language (UFL) [I], a domain-specific 
language (DSL) for the specification of partial differential equations in weak form pi¬ 
oneered by the FEniCS project [18]. Firedrake imposes a clear separation of concerns 
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between the definition of the problem, the local discretisation defining the computa¬ 
tional kernel used to compute the solution and the parallel execution of this kernel over 
a given data set [20]. These multiple layers of abstraction allow various types of op¬ 
timisation to be applied during the solution process, ranging from high-level caching 
of mathematical forms to compiler-level optimisations that leverage threading and 
vectorisation intrinsics within the assembly kernels. 

A key component to achieving performance in Firedrake is PyOP2, a high-level 
framework that optimises the parallel execution of numerical kernels over unstructured 
mesh data [21]. PyOP2 represents mesh entities as sets and connectivity between them 
as mappings, where input data to the compiled kernel is either accessed directly or 
indirectly via a mapping. In parallel PyOP2 is able to overlap halo data communi¬ 
cation with kernel computation during the execution loop due to a specialised data 
ordering within sets [19]. 

2.2. DMPlex. PETSc’s ability to manage unstructured meshes is centred around 
DMPlex, a data management object that encapsulates the topology of unstructured 
grids and provides a wide range of common mesh management functionalities to ap¬ 
plication programmers [16]. As such DMPlex provides a domain topology abstraction 
that decouples user applications from the implementation details of common mesh- 
related utility tasks, such as file I/O, domain decomposition methods and parallel 
load balancing [15], which increases extensibility and improves interoperability be¬ 
tween scientific applications through librarization [5]. 

DMPlex uses an abstract representation of the unstructured meshes in memory, 
where the connectivity of topological entities is stored as a directed acyclic graph 
(DAG) [14, 17]. The DAG is constructed of clearly defined layers (strata) that enable 
access to mesh entities by their topological dimension or co-dimension, enabling appli¬ 
cation codes to be written without explicit reference to the topological dimension of 
the mesh. As illustrated in Fig. lb, all points in the topology DAG share a single con¬ 
secutive entity numbering, emphasizing that each point is treated equally no matter its 
shape or dimension, and allowing DMPlex to store the graph connectivity in a single 
array where dimensional layers are defined as consecutively numbered sub-ranges. The 
directional connectivity of the DAG is defined by the covering relationship cone{p), 
which denotes all points directly connected to p in the next codimension, as illustrated 
in Fig. Ic. The transitive closure of the cone operation is denoted as closure(p), and 
depicted in Fig. Id. The dual operation, support(p), and it’s transitive closure star(p) 
are shown in Fig. le and Fig. If respectively. 

In addition to the abstract topology data, PETSc provides two utility objects to 
describe the parallel data layout: a Section object maps the graph-based topology 
information to discretised solution data through an offset mapping very similar to the 
Compressed Sparse Row (GSR) storage scheme, and the Star Forest [3] (SF) object 
holds a one-sided description of shared data in parallel. These data layout mappings 
allow DMPlex to manage distributed solution data by automating the preallocation 
of distributed vector and matrix data structures and performing halo data exchanges. 
Moreover, by storing grid topology alongside discretised solution data, DMPlex is able 
provide the mappings required for sophisticated preconditioning algorithms, such as 
geometric multigrid methods [6] and multi-block, or “Fieldsplit”, preconditioning for 
multi-physics problems [4]. 

2.3. Mesh Reordering Techniques. The run-time performance of geometry- 
based processing algorithms can be significantly affected by the data layout of unstruc¬ 
tured meshes and sparse matrices due to caching effects. A number of mesh ordering 
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(a) Vertex and edge numbering (b) Connectivity of entities in a DAG 



(c) cone{5) = 
{9,10,11} 



(d) closure{5) — 
{1,2,3,9,10,11} 



Fig. 1: Example entity numbering for a single tetrahedron and the corresponding 
internal DAG. Entities are numbered accross stratified layers (dimensions) with a 
consecutive numbering in each stratum. 


techniques exist that aim to increase the cache coherency of local data, either through 
cache-aware or cache-oblivious reordering [22, 11, 12]. Cache-oblivious techniques aim 
to reduce the bandwidth of the resulting sparse matrix and thus lower the number of 
cache misses incurred when traversing local data regardless of the underlying caching 
architecture. 

The Reverse Cuthill-McKee (RCM) algorithm [7, 9] represents a classic exam¬ 
ple of a cache-oblivious mesh reordering. RCM is based on a variant of a simplex 
breadth-first search of the mesh connectivity graph and yields a fixed-size n tuple 
that represents the new ordering permutation. Alternative methods, such as space 
filling curve numberings, may be used to create similar permutations from a given 
mesh topology graph in order to further increase cache coherency. 
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3. Computational Meshes in Firedrake. The Firedrake system comprises a 
stack of specialised components that implement a set of multi-layered abstractions 
to provide automated finite element computation from a high-level specification [20]. 
The role of the top-level Firedrake layer is to marshall data between the various sub¬ 
components and to provide the computation layers, PyOP2 and PETSc, with the 
maps and data objects required to assemble and solve linear and non-linear systems. 
The computational mesh is encapsulated in a Mesh object that can either be read from 
file or generated in memory for common geometry classes, such as squares, cubes or 
spheres. 

A characteristic feature of the Firedrake execution stack is that multiple discreti¬ 
sations of the same computational domain, represented by the FunctionSpace class, 
may be derived dynamically at any point during execution, which requires the topo¬ 
logical connectivity of the mesh to be stored in a separate object. Separating mesh 
topology from the discretisation of the problem not only enables Firedrake to exploit 
caching and data re-use with minimal replication at multiple levels in the tool chain 
but also allows data layout optimisations to be inherited for all derived discretisations 
without re-computation of the mesh reordering scheme. 

As shown in Fig. 2 the Firedrake classes Mesh and FunctionSpace, which encap¬ 
sulate mesh topology and problem discretisation respectively, map naturally onto 
the abstractions provided by PETSc’s data management API. The Mesh class encap¬ 
sulates the topological connectivity of the grid by storing a DMPlex object along¬ 
side a Firedrake-specific application ordering, while discretisation data given by the 
FunctionSpace class defines the layout of local data stored in the Function object. 
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Fig. 2: Mapping of data abstractions between Firedrake and PETSc: Firedrake’s 
Mesh object encapsulates domain topology stored in a DMPlex object alongside an 
application numbering permutation. The choice of FunctionSpace defines the local 
data discretisation via a PetscSection that is used to generate the indirection maps 
required by PyOP2 for assembly computation. Halo communication is performed by a 
PetscSF object, which encapsulates the mapping between local and remote data items 
in the local Vec. 
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3.1. Mesh topology. Firedrake uses the DMPlex data management abstraction 
as an internal representation of mesh topology, allowing it to delegate file I/O and 
run-time mesh generation to PETSc. In doing so Firedrake only depends on the 
public API provided by PETSc and automatically inherits the mesh management 
and manipulation capabilities provided by DMPlex. As a result Firedrake naturally 
supports the same set of mesh hie formats as DMPlex, which at the time of writing 
includes ExodusII, Gmsh, CGNS, MED and Fluent Gase hies, and thus increases 
interoperability with other applications and provides extensibility through a well- 
supported public library. 

In addition to various mesh format readers DMPlex also provides parallel do¬ 
main decomposition routines that interface with external libraries, such as Chaco and 
Metis/Par Metis, to facilitate parallel partitioning of the topology graph. Utilising 
PETSc’s internal communication routines DMPlex is thus capable of automatically 
distributing the mesh across any number of processes, which allows Firedrake to 
fully automate the parallelisation and optimisation of the user-dehned Finite Element 
problem. 

Another advantage of using the DMPlex DAG as an intermediate representation 
of mesh topology is that the abstracted graph format allows Firedrake to dictate the 
ordering of the mesh topology and thus control local data layout of derived discretisa¬ 
tions. This is made possible by attaching a point permutation to the DMPlex object, 
which defines a single level of indirection that is applied to all graph traversal opera¬ 
tions within DMPlex. As a results, all discretisation objects derived from the stored 
topology inherit this permutation, giving Firedrake an effective way to control the 
global ordering of derived solution data. 

3.2. Discretisation. The FEniCS language (UFL [I]) implemented by Fire¬ 
drake allows the use of various discretisation schemes to represent solution data, 
where the number of DoFs associated with each mesh entity is determined by the 
local discretisation within a reference element. The FIAT package [13] of the FEniGS 
software stack provides this reference element from which Firedrake needs to derive 
the indirection maps between mesh cells and DoFs required by PyOP2 to perform 
matrix and vector assembly. 

The mapping from mesh topology to solution data is facilitated by PETSc through 
PetscSection, a class of descriptor objects that store a CSR-style mapping between 
points in the topology DAG and entries in array or vector objects. Assuming a 
constant element type throughout the mesh, DMPlex can generate a section object, 
given the number of DoFs associated with each mesh entity type as provided by the 
FIAT reference element. The set of DoFs associated with a cell can then be derived 
by taking the closure of the cell point (see Fig. Id) and collecting the DoFs associated 
with them by the provided section. 

The use of DMPlex closures to determine entity-to-DoF mappings is sufficient 
on its own should the local numbering of mesh entities within a cell closure match 
that required by the application. In Firedrake the local numbering on simplices must 
match the simplex numbering used in FEniCS [18], where the local facet number is 
determined by the local number of the opposite vertex. The algorithm shown in Alg. 1 
is thus applied to each cell closure in turn to enforce the desired local numbering for 
simplices. 

3.3. Halo communication. The exchange of halo data between processors in 
Firedrake is performed by PETSc’s Star Forest [3] (SF) communication abstraction 
that encapsulates one-sided description of shared data. SF objects implement a range 


6 


M. Lange et al. 


Algorithm 1 Local numbering algorithm for simplex elements 
1; for cell in mesh do > Loop over all cells in the mesh 

2: closurecell ^ DMPLEXGETCLOSURE(p^ea;, cell) 

3: for p in closurecell do t> Filter facets and vertices from cell closure 

4: if p in DEPTHSTRATUM(plea:, 0) then vertices ^ p 

5: if p in HEiGHTSTRATUM(pZea;, 1) then facets ■<r- p 

6: SORT(vertices) > Sort vertices by global number 

7: for facet in facets do 

8: closure facet DMPLEXGETCLOSURE(p/ea;,/ocet) 

9: for / in closure facet do > Filter vertices from facet closure 

10; if / in DEPTHSTRATUM(plea;, 0) then vfacet ^ / 

11; for V in vertices do > Find non-adjacent vertices 

12; if V not in v facet then keys ^ {facet, v) 

13; SORT{facets, keys) > Sort facets by non-adjacent vertices 


of sparse communication patterns that are able to perform common data communi¬ 
cation patterns, such broadcasts and reduction operations, over sparse data arrays 
according to the stored mapping. The halo data exchange pattern is derived by DM- 
Plex from an internal SF encapsulating the overlap in the topology graph and a given 
discretisation provided in the form of a section object. The derived SF encapsulates 
a local-to-local remote data mapping that avoids the need to convert halo data into 
a global numbering. 

4. Application orderings. When deriving function spaces from a DMPlex 
topology definition, the global data layout is inherited from the original graph or¬ 
dering generated by PETSc. PyOP2, however, imposes a data layout restriction that 
allows it to optimise performance by overlapping computation with communication, 
which is not honoured in the global entity numbering generated by DMPlex. Fire- 
drake therefore generates an application ordering in the form of a permutation of the 
DAG points that is passed to a distributed DMPlex object to generate indirection 
maps that adhere to its required ordering. 

4.1. PyOP2 data ordering. To ensure that halo exchange communication can 
be overlapped with assembly kernel computation PyOP2 sets require a strict entity 
ordering, where non-owned data is stored contiguously at the end of the data array. 
Moreover, as shown in Fig. 3, all owned data items adjacent to non-owned items 
require that the halo data exchange be finished before computation is performed. 
Thus, owned data is further partitioned into core (independent of halo) and non¬ 
core (halo-dependent) data, allowing processing over core data items to proceed while 
communication is still in-flight. 

Firedrake honours the PyOP2 entity ordering by assigning all points in the DM¬ 
Plex topology DAG to one of the PyOP2 entity classes using a DMLabel data structure, 
which encapsulates integer value assignments to points. When deriving the indirection 
maps for each discretisation, mesh entities can then be filtered into the appropriate 
sets regardless of entity type. The algorithm used to mark PyOP2 entity classes is 
shown in Alg. 2, where the initial overlap definition, provided by DMPlex in form of 
an SF, is used to first mark the halo region, followed by the derivation of adjacent 
non-core points. 
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Partition 1 


Fig. 3: PyOP2 entity classes on a distributed 4x4 unit square mesh. The dark region 
marks core entities, medium grey marks non-core entities and light grey marks the 
halo region. 


Algorithm 2 Algorithm to mark PyOP2 entity classes on DMPlex based on the 
initial halo definition given by DMPlex. Point adjacency in the DAG is defined as 
adjacency (jp) = closure{star(j>)). 


1 

for p in pointsF do 

> Define halo region from SF 

2 

LABELSETVALUE(/ia;o, p) 


3 

for P in LABELGETSTRATUM(/ia/o) do 

> Loop over halo cells 

4 

if p in HEiGHTSTRATUM(pZea::, 0) then 


5 

adjacency DMPLEXGETADJACENCY(pZea;,p) 

6 

for c in adjacency do 

> Find cells adjacent to halo 

7 

if LABELHASPoiNT(/iaZo, c) and c 

in HEiGHTSTRATUM(p/ea;, 0) then 

8 

LABELSETVALUE(noncore, p) 

[> Mark adjacent cell as non-core 

9 

for p in mesh do 

[> Mark remaining points as core 

10 

if not LABELHASPoiNT(/iaZo, p) and not LABELHASPoiNT(noncore, p) then 

11 

LABELSETVALUE(core, p) 



4.2. Compact RCM ordering. The generic encapsulation of mesh topology 
allows DMPlex to compute the point permutation according to the well-known RCM 
mesh reordering algorithm (see section 2.3). Since Firedrake already controls the ef¬ 
fective ordering of mesh entities to adhere to PyOP2 ordering restrictions, the RCM 
permutation provided by DMPlex can be applied to the Firedrake-specific point per¬ 
mutation. However, any additional indirection applied to the reordering permutation 
computed by Firedrake needs to be contained within the marked PyOP2 class re¬ 
gions. Thus, although the base RCM permutation generated by DMPlex includes 
all graph points, Firedrake implements a cell-wise compact reordering, where the cell 
ordering is filtered from the RCM permutation within each marked PyOP2 region. 
As shown in Alg. 3, the full permutation is then derived by adding cell closures along 
the segmented cell order, ensuring the relative compactness of DoFs associated with 
the same cell. 

It is worth noting that this cell-wise compact reordering approach allows any 
additional level of indirection to be applied without violating the PyOP2 ordering 



























M. Lange et al. 


Algorithm 3 Algorithm for generating a compact RCM permutation that honours 
PyOP2’s entity class separation and encapsulates a cell-wise RCM reordering within 
the PyOP2 regions. 

1; ordering DMPLEXGETORDERiNG(i?C'M) > Get RCM renumbering 

2: for class in {core, noncore, halo} do > Get array index for each class 

3: idXclass t— LABELSTRATUMSlZE(c?aSs) 

4; for p in mesh do 
5: Prcm t— ordering{p} 

6; if Prcm not in HEiGHTSTRATUM(pZea;, 0) then skip p 

7; for class in {core, noncore, halo} do > Get array index for current class 
8; if LABELHASPoiNT(c/ass,p) then idx ^ idxciass 

9: for Pclosure in DMPLEXGETGLOSURE(p/ex, Prcm) do 

10: permutation{idx} •(— Pdosure 


constraint and is therefore not limited to RCM. Examples of sparse matrix structures 
generated using the RCM-based reordering are shown in Fig. 4. 

5. Performance benchmarks. The benefits of Firedrake’s compact RCM mesh 
reordering have been evaluated using two sets of performance benchmarks: a run-time 
comparison of assembly loops over cells and interior facets with light-weight kernels, 
as well as solving a full advection-diffusion problem. The benchmark experiments 
were carried out on the UK national supercomputer ARCHER, a Cray XE30 with 
4920 nodes connected via an Aries interconnect and a parallel Lustre filesystem 
Each node consists of two 2.7 GHz, 12-core Intel E5-2697 v2 (Ivy Bridge) processors 
with 64GB of memory. 

An indication of the indirection cost and subsequent data traversal performance 
in low-level loops was gained by comparing the individually measured execution time 
of two PyOP2 assembly loops. The benchmark loops were generated by invoking 
assemble(L) 100 times for the UFL expressions L = u*dx and L = u(’ + ’)*dS for cell 
and interior facet integrals respectively, where u is a suitable Function object. The 
performance of a full-scale finite element problem was then analysed, which consisted 
of assembling and solving the advection-diffusion equation ff + V (^c) = V V 
c) using a Conjugate Gradient method with a Jacobi preconditioner for advection 
and the HYPRE BoomerAMG algebraic multigrid preconditioner [ 8 ] for the diffusion 
component. The mesh used in both experiments represents a two-dimensional L- 
shaped domain, consisting of 3,105,620 cells and 1,552,808 vertices, and was generated 
with Gmsh [10]. 

The performance of the assembly loops over cells and interior facets using Pi 
and P 3 function spaces on up to 96 cores is shown in Fig. 5. The performance of 
the cell integral loop shows significant improvements in both cases, whereas the facet 
integral loop shows a small performance decrease. This highlights that the compact 
RCM reordering optimises cell integral computation due to the generated cell-wise 
compact traversal pattern. It is also worth noting that the improvement due to 
RCM diminishes as we approach the strong scaling limit, although an increase in 
computational intensity between Pi and P 3 assembly kernels negates this effect. 

A performance profile of the full advection-diffusion model is given in Fig. 6 . Ma- 


^http://www.archer.ac.uk/ 
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trix and RHS assembly times indicate clear performance improvements under compact 
RCM with significant speedups for Pi on small numbers of cores (see Fig. 6 a). As 
shown in Fig. 6 b, P 3 assembly kernels with a higher computational intensity also show 
significant performance improvements, where matrix assembly in particular benefits 
from the reordering in a sustained way up to 96 cores. Similarly, advection and dif¬ 
fusion solver times shown in Fig. 6 c and Fig. 6 d indicate a clear speedup on small 
numbers of cores, while significant improvements are also evident on up to 96 cores 
for solves with larger numbers of DoFs in P 3 . 

6. Discussion. In this paper we give a full account of the utilisation of PETSc’s 
DMPlex topology abstraction in Firedrake to derive the topological mapping required 
to solve a wide range of finite element problems. We highlight how the right com¬ 
position of abstractions can be used to apply well known data layout optimisations, 
such as RCM renumbering, to an entire class of problems and demonstrate the result¬ 
ing gains in assembly and solver performance. Our work emphasises the importance 
of high-level DSLs and further underlines their potential for achieving performance 
portability through run-time optimisation. 

An important corollary of the close integration of DMPlex into the Firedrake 
framework is the improved interoperability and extensibility of the mesh management 
component. Future efforts to improve file I/O and add new meshing capabilities, such 
as mesh adaptivity, can now be integrated through PETSc DMPlex interfaces. This 
ensures that computational models built using the Firedrake framework can easily be 
extended without breaking existing abstractions and thus enables domain scientists to 
leverage automated performance optimisations as well as a wide range of simulation 
features. 
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Fig. 4: Effects of the combined RCM and OP2 mesh ordering on matrix structure for 
a Pi and a P 3 function space on a 5 x 5 unit square. 
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(a) PyOP2 loop performance for Pi 


(b) PyOP2 loop performance for P 3 


Fig. 5: Run-time comparison between compact RCM and native numbering for as¬ 
sembly loops over cells and interior facets. 




(a) Assembly performance for Pi (b) Assembly performance for P3 



(c) Solver performance for Pi (d) Solver performance for P3 


Fig. 6 : Run-time comparison between compact RCM and native numbering for the 
advection-diffusion problem on Pi and P 3 discretisations. 









